Introduction

The task is to provide the detail insights into the different marketing campaigns in a market.

a). Overview of entire market’s development and the different campaigns and summarize the most important findings.

b). Assess the development of the quality of traffic. How is the overall development and how does each campaign evolve?

c). Advice a business developer who wants to spend an additional 250€ per week from week 31 onwards in a market campaign.

• Advice a business developer in which campaign to invest and why?

• How does this impact the overall performance in the market from week 31 onwards?

Marketing campaign dataset

# Import the dataset
df <- read.csv("marketing_campaigns.csv",header=TRUE,stringsAsFactors = F, sep=";")
summary(df)
##       Week         Campaign             Visits         Revenue       
##  Min.   : 1.00   Length:91          Min.   : 27.0   Min.   :  2.269  
##  1st Qu.: 8.00   Class :character   1st Qu.:144.0   1st Qu.:122.557  
##  Median :16.00   Mode  :character   Median :158.0   Median :241.206  
##  Mean   :15.66                      Mean   :213.8   Mean   :236.743  
##  3rd Qu.:23.00                      3rd Qu.:233.0   3rd Qu.:357.413  
##  Max.   :30.00                      Max.   :613.0   Max.   :463.249  
##       Cost        
##  Min.   :  3.764  
##  1st Qu.:128.521  
##  Median :235.322  
##  Mean   :241.537  
##  3rd Qu.:348.419  
##  Max.   :507.522
From the above summary we can see that the minimum visits are 27 and maximum are 613. Similarly the minimum revenue is 2.269, mean is 236.743 and the maximum is 463.249. Minimum cost is 3.764, mean cost is 241.537 and maximum is 507.522. The marketing dataset shows the campaigning data of 30 weeks. The dataset consists of 5 variables, Week, Campaign, Visits, Revenue and Cost.

Problem Statement (a): Overview of entire market’s development and the different campaigns.

# Inspect the dataset
str(df)
## 'data.frame':    91 obs. of  5 variables:
##  $ Week    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Campaign: chr  "Aldebaran" "Aldebaran" "Aldebaran" "Aldebaran" ...
##  $ Visits  : int  27 64 80 93 120 130 146 173 170 218 ...
##  $ Revenue : num  2.27 10.82 7.13 11.09 14.28 ...
##  $ Cost    : num  3.76 15.32 10.75 16.91 21.45 ...
# Shows the dimensions
dim(df)  
## [1] 91  5
# shows the variable names
names(df) 
## [1] "Week"     "Campaign" "Visits"   "Revenue"  "Cost"
# shows the top 6 obs
head(df) 
# shows the last 6 obs
tail(df)
It looks like there is a duplicate in the 91st row for week 30. We will check this but let’s see if there is any missing data as well.
# let's check for the missing data
is.na(df)
##        Week Campaign Visits Revenue  Cost
##  [1,] FALSE    FALSE  FALSE   FALSE FALSE
##  [2,] FALSE    FALSE  FALSE   FALSE FALSE
##  [3,] FALSE    FALSE  FALSE   FALSE FALSE
##  [4,] FALSE    FALSE  FALSE   FALSE FALSE
##  [5,] FALSE    FALSE  FALSE   FALSE FALSE
##  [6,] FALSE    FALSE  FALSE   FALSE FALSE
##  [7,] FALSE    FALSE  FALSE   FALSE FALSE
##  [8,] FALSE    FALSE  FALSE   FALSE FALSE
##  [9,] FALSE    FALSE  FALSE   FALSE FALSE
## [10,] FALSE    FALSE  FALSE   FALSE FALSE
## [11,] FALSE    FALSE  FALSE   FALSE FALSE
## [12,] FALSE    FALSE  FALSE   FALSE FALSE
## [13,] FALSE    FALSE  FALSE   FALSE FALSE
## [14,] FALSE    FALSE  FALSE   FALSE FALSE
## [15,] FALSE    FALSE  FALSE   FALSE FALSE
## [16,] FALSE    FALSE  FALSE   FALSE FALSE
## [17,] FALSE    FALSE  FALSE   FALSE FALSE
## [18,] FALSE    FALSE  FALSE   FALSE FALSE
## [19,] FALSE    FALSE  FALSE   FALSE FALSE
## [20,] FALSE    FALSE  FALSE   FALSE FALSE
## [21,] FALSE    FALSE  FALSE   FALSE FALSE
## [22,] FALSE    FALSE  FALSE   FALSE FALSE
## [23,] FALSE    FALSE  FALSE   FALSE FALSE
## [24,] FALSE    FALSE  FALSE   FALSE FALSE
## [25,] FALSE    FALSE  FALSE   FALSE FALSE
## [26,] FALSE    FALSE  FALSE   FALSE FALSE
## [27,] FALSE    FALSE  FALSE   FALSE FALSE
## [28,] FALSE    FALSE  FALSE   FALSE FALSE
## [29,] FALSE    FALSE  FALSE   FALSE FALSE
## [30,] FALSE    FALSE  FALSE   FALSE FALSE
## [31,] FALSE    FALSE  FALSE   FALSE FALSE
## [32,] FALSE    FALSE  FALSE   FALSE FALSE
## [33,] FALSE    FALSE  FALSE   FALSE FALSE
## [34,] FALSE    FALSE  FALSE   FALSE FALSE
## [35,] FALSE    FALSE  FALSE   FALSE FALSE
## [36,] FALSE    FALSE  FALSE   FALSE FALSE
## [37,] FALSE    FALSE  FALSE   FALSE FALSE
## [38,] FALSE    FALSE  FALSE   FALSE FALSE
## [39,] FALSE    FALSE  FALSE   FALSE FALSE
## [40,] FALSE    FALSE  FALSE   FALSE FALSE
## [41,] FALSE    FALSE  FALSE   FALSE FALSE
## [42,] FALSE    FALSE  FALSE   FALSE FALSE
## [43,] FALSE    FALSE  FALSE   FALSE FALSE
## [44,] FALSE    FALSE  FALSE   FALSE FALSE
## [45,] FALSE    FALSE  FALSE   FALSE FALSE
## [46,] FALSE    FALSE  FALSE   FALSE FALSE
## [47,] FALSE    FALSE  FALSE   FALSE FALSE
## [48,] FALSE    FALSE  FALSE   FALSE FALSE
## [49,] FALSE    FALSE  FALSE   FALSE FALSE
## [50,] FALSE    FALSE  FALSE   FALSE FALSE
## [51,] FALSE    FALSE  FALSE   FALSE FALSE
## [52,] FALSE    FALSE  FALSE   FALSE FALSE
## [53,] FALSE    FALSE  FALSE   FALSE FALSE
## [54,] FALSE    FALSE  FALSE   FALSE FALSE
## [55,] FALSE    FALSE  FALSE   FALSE FALSE
## [56,] FALSE    FALSE  FALSE   FALSE FALSE
## [57,] FALSE    FALSE  FALSE   FALSE FALSE
## [58,] FALSE    FALSE  FALSE   FALSE FALSE
## [59,] FALSE    FALSE  FALSE   FALSE FALSE
## [60,] FALSE    FALSE  FALSE   FALSE FALSE
## [61,] FALSE    FALSE  FALSE   FALSE FALSE
## [62,] FALSE    FALSE  FALSE   FALSE FALSE
## [63,] FALSE    FALSE  FALSE   FALSE FALSE
## [64,] FALSE    FALSE  FALSE   FALSE FALSE
## [65,] FALSE    FALSE  FALSE   FALSE FALSE
## [66,] FALSE    FALSE  FALSE   FALSE FALSE
## [67,] FALSE    FALSE  FALSE   FALSE FALSE
## [68,] FALSE    FALSE  FALSE   FALSE FALSE
## [69,] FALSE    FALSE  FALSE   FALSE FALSE
## [70,] FALSE    FALSE  FALSE   FALSE FALSE
## [71,] FALSE    FALSE  FALSE   FALSE FALSE
## [72,] FALSE    FALSE  FALSE   FALSE FALSE
## [73,] FALSE    FALSE  FALSE   FALSE FALSE
## [74,] FALSE    FALSE  FALSE   FALSE FALSE
## [75,] FALSE    FALSE  FALSE   FALSE FALSE
## [76,] FALSE    FALSE  FALSE   FALSE FALSE
## [77,] FALSE    FALSE  FALSE   FALSE FALSE
## [78,] FALSE    FALSE  FALSE   FALSE FALSE
## [79,] FALSE    FALSE  FALSE   FALSE FALSE
## [80,] FALSE    FALSE  FALSE   FALSE FALSE
## [81,] FALSE    FALSE  FALSE   FALSE FALSE
## [82,] FALSE    FALSE  FALSE   FALSE FALSE
## [83,] FALSE    FALSE  FALSE   FALSE FALSE
## [84,] FALSE    FALSE  FALSE   FALSE FALSE
## [85,] FALSE    FALSE  FALSE   FALSE FALSE
## [86,] FALSE    FALSE  FALSE   FALSE FALSE
## [87,] FALSE    FALSE  FALSE   FALSE FALSE
## [88,] FALSE    FALSE  FALSE   FALSE FALSE
## [89,] FALSE    FALSE  FALSE   FALSE FALSE
## [90,] FALSE    FALSE  FALSE   FALSE FALSE
## [91,] FALSE    FALSE  FALSE   FALSE FALSE

There is no missing data. Now let’s check for the duplicates.

# let's check for the duplicates
duplicated(df)
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
There is a missing data found in the 91st row for week 30th Cottington campaign. We need to get rid of that.
df <- df[-91,] # remove the last row 91st
tail(df) # Let's check again

1. Visualizing the dataset to get some insight into the campaigns for 30 weeks using the package Plotly. Plotly generates interactive charts. If we move the cursor on the chart, we can see the results.

There are features like zoom in and out, box select, select and deselect the variables, download etc.

NOTE: To see the values please move the cursor in the chart.
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

1.1 Visualize the visits per week for each campaign

p <- plot_ly(df, x = ~Week, y=~Visits, color = ~Campaign)
p

1. Aldebaran Campaign

The plot shows that the Aldebaran campaign had only 27 visits in the first week, 64 visits in the second week which is more than the double. Later, This campaign showed a tremendous increase in the visits and has attracted more visitors to the campaign i.e 613 during last week (30th week).

2. Bartledan Campaign

Bartledan had 128 visits during the first week which is very high as compare to Aldebaran. During the third week, it came down to 99 and then very slowly started increasing and got only 247 visits during the last week.

3. Cottington campaign

The Cottington campaign had 149 visits in the first week, which is higher than the rest two Aldebaran and Bartledan. In the last week, the visits decreased to 125 which is the lowest amongst all.

1.2 Visualize the cost spent on each campaign

p <- plot_ly(df, x = ~Campaign, y=~Cost, color = ~Campaign)
p
From the above chart we can see that the cost spent on the cottington campaign is very high followed by Bartledan and Aldebaran. The total cost spent on Aldebaran is very low as compare to the other two campaigns.

1.3 Visualize the revenue generated by each campaign

p <- plot_ly(df, x = ~Campaign, y=~Revenue, color = ~Campaign)
p

From the above chart we can see that the Cottington Campaign has generated more revenue followed by Bartledan and Aldebaran.

1.4 Visualize the profit for each campaign, but first, we have to calculate the profit.

Profit = Revenue - Cost

prof <- df %>% 
  mutate(Profit = Revenue - Cost)
# Check which campaign got the maximum and minimum profit
p <- plot_ly(prof, x = ~Campaign, y=~Profit, color = ~Campaign)
p

The profit chart shows that the Cotttington Campaign got the highest profit amongst all.

Aldebaran got some profit in the end but it is almost no profit no loss.

The Bartledan is in loss and the worst amongst all.

Problem Statement (b): How would you assess the development of the quality of traffic, e.g. in terms of revenue per visitor. How is the overall development and how does each campaign evolve?

To answer this we have to create few metrics and visualizations. We will calculate Revenue Per Visitor, Cost Per Visitor, Profit Per Visitor and Overall Campaign Profit. This will give us a clear picture of the campaign.

1.5 Calculate Revenue Per Visitor

Revenue per Visitor (RPV) is a measurement of the amount of money generated each time a customer visits the website. Formula to calculate RPV is by dividing the total revenue by the total number of visitors visit the site and is a method of estimating the value of each additional visitor. We have weekly data so we will calculate weekly revenue per visitor
# Calculate RPV using the formaula RPV = Revenue / Visits
rpv <- prof %>% 
  mutate(RPV = Revenue / Visits)
# Visualize Weekly Revenue Per Visitor using Line Chart
p <- plot_ly(rpv, x = ~Week, y=~RPV, color = ~Campaign, type = 'scatter', 
             mode = 'lines')
p

As we can see that the Cottington campaign has generated maximum revenue of 2.85 per visitor followed by Bartledan and Aldebaran at 1.65 and 0.42. When we visualized the visits per week for each campaign, we saw that the Cottington had 149 visits in the first week, which was higher than the rest two Aldebaran and Bartledan. In the last weeks, the visits decreased to 125 which was the lowest among all. Even after less number of visitors, Cottington campaign got the highest RPV amongst all the three campaigns.

The revenue per visitor doesn’t show us the clear picture of the campaign, the cost spent on the campaign is also a very important factor. As we have already seen above the cost spent on Cottington and Bartledan was very high.

1.6 Calculate the Cost Per Visitor.

Now we will calculate the cost per visitor in the same way we calculated revenue per visitor.

# Calculate RPV using the formaula RPV = Revenue / Visits
cpv <- prof %>% 
  mutate(CPV = Cost / Visits)
# Visualize Weekly Revenue Per Visitor using Line Chart
p <- plot_ly(cpv, x = ~Week, y=~CPV, color = ~Campaign, type = 'scatter', 
             mode = 'lines')
p

As we can see in the above plot that the cost spent per visitor is very high for Cottington (3.11) and Bartledan (1.78). The cost spent per visitor for Aldebaran is very low as compare to the other two and that is 0.34. This metric tells us what is the cost spent to attract one visitor. This should be as low as possible to generate more profit per visitor.

1.7 Calculate Profit Per Visitor.

# Calculate RPV using the formaula RPV = Revenue / Visits
ppv <- prof %>% 
  mutate(PPV = Profit / Visits)
# Visualize Weekly Revenue Per Visitor using Line Chart
p <- plot_ly(ppv, x = ~Week, y=~PPV, color = ~Campaign, type = 'scatter', 
             mode = 'lines')
p

Now we can see clearly the insight of the development in terms of traffic which is totally different from the Revenue Per Visitor. The plot above shows that there is actually no profit generated by Cottington and Bartledan, instead, they are in loss. The Aldebaran campaign generated the profit after the 20th week. The Aldebaran generated the highest profit of 0.07 per visitor which is far better than the other two campaigns. The plot shows that the Bartledan is at loss of 0.13 per visitor and Bartledan is at loss of 0.26 per visitor.

1.8 Let’s calculate the Overall Campaign Profit

over_profit <- rpv%>% 
  select(Week, Campaign, Revenue, Cost) %>% 
  pivot_longer(cols = Revenue:Cost, names_to ="CostRevenue" ,values_to = "Value") %>% 
  group_by(Campaign,CostRevenue) %>% 
  summarise(Value = round(sum(Value),1))
over_profit
# Plot the overall profit
p <- plot_ly(over_profit, x = ~Campaign, y=~Value, color = ~CostRevenue)
p

The overall campaign profit shows that the cottington generated the highest profit. Barteladen is in loss and the Aldebaran is almost no profit no loss.

The overall summary of the campaign.

Aldebaran:

The Aldebaran started with the loss. Aldebaran campaign had very low visits in the first week, but in the second week, it was more than double.
The revenue generated per visitor increased by 2.5 folds from week 1 to 30.
After week 20 this campaign showed a tremendous increase in the visits and has attracted more visitors to the campaign and in the end it had the best quality of traffic.

Bartledan:

Bartledan had 128 visits during the first week. During the third week, it went down to 99 and then very slowly started increasing and got only 247 visits during the last week.
The cost spent on this campaign was the second-highest and according to the Profit Per Visitor, it started the campaign with the highest loss amongst all and continued to same until week 26. Even after 26 weeks, it couldn’t generate much profit.
The revenue generated per visitor increases by nearly 1.6 times from week 1 to week 30.
The least loss it incurred was in the 4th week of 7.28. This campaign was the worst of all.

Cottington.

The Cottington campaign had the highest visits of 149 in the first week. In the last week, the visits decreased to 125 which was the lowest amongst all. Even after less number of visitors, Cottington campaign got the highest RPV amongst all the three campaigns.
The increase in revenue per visitor was steady from week 1 to 29, but suddenly after week 30, there was an unusual increase in the revenue that was due to the unusual increase in the cost spent.
The revenue generated per visitor increases by nearly 1.5 times from week 1 to 30.
Cottington was the only campaign that generated the highest profit from week 1 but it started felling down gradually and generated the most loss per week.
After week 20 it couldn’t generate profits.

1.9 Advice in which campaign to invest and why?

Based on our findings I would advise the business developer to invest in Aldebaran campaign because if an investor spends more in the Aldebaran campaign, it will generate more profit. The cost spent on Aldebaran was very low and still, it generated some profit in the end. If the investor will spend more in Aldebaran it will definitely generate more profit.

1.10 How do you expect this to impact the overall performance in the market from week 31 onwards?

To answer this let’s try to predict a campaign profit using a regression Model. Let’s see if it works.

model1 <- lm(Profit ~ Campaign, data = rpv)
summary(model1)
## 
## Call:
## lm(formula = Profit ~ Campaign, data = rpv)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -56.473  -9.864   1.269  14.199  46.555 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          0.9631     3.6803   0.262   0.7942    
## CampaignBartledan  -27.5866     5.2047  -5.300 8.64e-07 ***
## CampaignCottington  11.2372     5.2047   2.159   0.0336 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.16 on 87 degrees of freedom
## Multiple R-squared:  0.4038, Adjusted R-squared:  0.3901 
## F-statistic: 29.47 on 2 and 87 DF,  p-value: 1.693e-10

With reference to this model summary, we have residual standard error as 20.58, which should be as small as possible (logically with value 0 denotes perfect prediction). Where Multiple R squared as 0.3794 which denotes this model has nearly 38% prediction accuracy.

We can see that CampaignBartledan shows a significant profit of approx 9.8 and CampaignBartledan shows a negative -27.58, which means there will be a loss if invested on Bartledan Campaign.

For Aldebaran there is almost no profit no loss.

Cottington will surely do more profit after 31 weeks, but there is one thing to notice and that is Aldebaran also did very good and the cost spent on Aldebaran was also very less and it generated profit as well, which was not much but still its a profit.

Aldebaran is progressing and might generate more profit in the next weeks but it would be slow but the investment is also less as compare to others.

2.0 Session Data

## Avoidance of scientific numbers
options(scipen = 999)
# Import the session dataset
df_sessions <- read.csv("session_data.csv",header=TRUE,stringsAsFactors = F, sep=",")
summary(df_sessions)
##     session               session_start_text session_end_text     clickouts    
##  Min.   :20170500000000   Length:10000       Length:10000       Min.   :0.000  
##  1st Qu.:20170500000000   Class :character   Class :character   1st Qu.:2.000  
##  Median :20170500000000   Mode  :character   Mode  :character   Median :2.000  
##  Mean   :20170500000000                                         Mean   :2.485  
##  3rd Qu.:20170500000000                                         3rd Qu.:3.000  
##  Max.   :20170500000000                                         Max.   :8.000  
##     booking      
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.0967  
##  3rd Qu.:0.0000  
##  Max.   :1.0000

Data Exploration and Preparation

# Inspect the dataset
str(df_sessions) 
## 'data.frame':    10000 obs. of  5 variables:
##  $ session           : num  20170500000000 20170500000000 20170500000000 20170500000000 20170500000000 ...
##  $ session_start_text: chr  "06:11:53" "21:06:41" "12:03:01" "05:58:00" ...
##  $ session_end_text  : chr  "06:15:11" "21:08:23" "12:06:02" "06:02:56" ...
##  $ clickouts         : int  3 3 3 0 1 4 3 2 2 4 ...
##  $ booking           : int  0 0 0 0 0 0 0 1 0 0 ...
# I think we need to change the time format, its a character here. 

dim(df_sessions) 
## [1] 10000     5
# shows the top 6 obs
head(df_sessions) 
# shows the last 6 obs
tail(df_sessions) 

While proceeding the session dataset, it couldn’t work because we need to change the time in a correct format. For this we will use a package called “hms” which provides a simple class for storing durations or time-of-day values and displaying them in the hh:mm:ss format.

library(hms)
df_sessions$session_start_text <- as_hms(df_sessions$session_start_text)
df_sessions$session_end_text <- as_hms(df_sessions$session_end_text)
str(df_sessions)
## 'data.frame':    10000 obs. of  5 variables:
##  $ session           : num  20170500000000 20170500000000 20170500000000 20170500000000 20170500000000 ...
##  $ session_start_text: 'hms' num  06:11:53 21:06:41 12:03:01 05:58:00 ...
##   ..- attr(*, "units")= chr "secs"
##  $ session_end_text  : 'hms' num  06:15:11 21:08:23 12:06:02 06:02:56 ...
##   ..- attr(*, "units")= chr "secs"
##  $ clickouts         : int  3 3 3 0 1 4 3 2 2 4 ...
##  $ booking           : int  0 0 0 0 0 0 0 1 0 0 ...

2.1 Now we will calculate the time duration of a session in minutes (between session start and session end)

df_sessions <- df_sessions %>%
  mutate(duration = 
           round(as.numeric(( session_end_text - session_start_text)/60)))

head(df_sessions)

2.2 We can see that the column duration is generated and showing the time duration of a session in minutes.

Let’s count the number of bookings for 0 and 1

df_sessions %>%
  group_by(booking) %>%
  summarise(count = n())

2.3 Let’s count the number of clickouts

df_sessions %>%
  group_by(clickouts) %>%
  summarise(count = n())

As we can see there are mostly are 2 and 3 clickouts.

2.4 Let’s count the durations

df_sessions %>%
  group_by(duration) %>%
  summarise(count = n())

The duration above shows negative values, it looks like a bug or R has interpreted it in a wrong way. After analyzing the problem and after long research I found out that we need to convert the minutes to seconds by adding 86400 seconds (86400 number of seconds in 24 hours), which will give us the number of seconds instead of minutes.

For example

-1438*60 = -86280
86400-86280 = 120 seconds (2 minutes)
df_sessions$duration[df_sessions$duration==-1438] = 2
df_sessions$duration[df_sessions$duration==-1437] = 3
df_sessions$duration[df_sessions$duration==-1436] = 4
df_sessions$duration[df_sessions$duration==-1435] = 5

2.5 Now we can check if it’s showing correct or not

df_sessions %>%
  group_by(duration) %>%
  summarise(count = n())

we can see that its showing correct values now.

2.6 Visualize the session dataset for the distribution of booking with clickouts and duration using ggplot.

df_sessions %>%
  group_by(booking, clickouts) %>%
  summarise(count = n()) %>%
  ggplot() + geom_bar(aes(x = clickouts, y = count, 
                          fill = booking), stat = "identity") + 
  guides(fill = guide_legend(title = "Booking")) + 
  facet_grid(. ~ booking) +
  labs(x = "clickouts", y = "booking", 
       title = "Booking w.r.t Clickouts")

As we can see in the above plot the booking data is normally distributed with respect to clickouts. We can also see that most of the bookings are done in 2 and 3 clickouts.

2.7 Now let’s check for the time duration.

df_sessions %>%
  group_by(booking, duration) %>%
  summarise(count = n()) %>%
  ggplot() + geom_bar(aes(x = duration, y = count, 
                          fill = booking), stat = "identity") + 
  guides(fill = guide_legend(title = "Booking")) + 
  facet_grid(. ~ booking) +
  labs(x = "Session Minutes", y = "Booking", 
       title = "Booking w.r.t Duration")

This plot also shows that the booking data is normally distributed concerning time duration. We can see that most of the bookings are done in 3 minutes.

Now we can say that there is a relation between booking and duration and booking and clickouts as well. We can do the Hypothesis test to see the relationship between them, let’s see if we can prove the hypothesis test.

2.8 But first let’s check the correlation between these variables using the correlation matrix. Let’s see what the correlation matrix says before doing the hypothesis.

## Generate the correlation matrix
cor <- cor(df_sessions %>% select(booking, clickouts, duration))
cor
              booking   clickouts   duration
booking    1.00000000 -0.04981168 0.01131174
clickouts -0.04981168  1.00000000 0.04017565
duration   0.01131174  0.04017565 1.00000000
plot(cor)

The correlation plot above shows that there is no relation between them. That’s interesting to see, when we visualized the distribution plot it showed a relation between them and now in a correlation matrix plot, it shows there is no relation at all. It would be interesting to do the hypothesis test to check and verify what it is, is there any relation or not.

2.9 Hypothesis Test

In our case the null hypothesis(H0) and the alternative hypothesis(H1) would be:

H0: There is no relation between Booking and (duration/clickouts).
H1: We fail to reject the assumption that there is a relation between Booking and (duration/clickouts).
model = glm(booking ~ duration + clickouts, data = df_sessions
                  , family = binomial(link = "logit"))
summary(model)
## 
## Call:
## glm(formula = booking ~ duration + clickouts, family = binomial(link = "logit"), 
##     data = df_sessions)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.5741  -0.4651  -0.4395  -0.4137   2.3807  
## 
## Coefficients:
##             Estimate Std. Error z value             Pr(>|z|)    
## (Intercept) -1.97323    0.12382 -15.936 < 0.0000000000000002 ***
## duration     0.04230    0.03138   1.348                0.178    
## clickouts   -0.16154    0.03214  -5.027          0.000000499 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6355.4  on 9999  degrees of freedom
## Residual deviance: 6328.7  on 9997  degrees of freedom
## AIC: 6334.7
## 
## Number of Fisher Scoring iterations: 5

The P-Value for the duration is 0.178 which is very large i.e greater than 0.05. The P-value should be smaller than alpha and the value of alpha is 0.05 which means we fail to reject the null hypothesis. So it proves that there is no relation between booking and duration.

The P-value for clickouts is 0.000000499 (4.99*10^-7) and we already said above that it should be smaller than alpha value i.e 0.05. In this case it is smaller than alpha. So we reject the null hypothesis and we can say that we have sufficient evidence to prove that there is a relation between the booking and clickouts.